To perform a cluster analysis in R, generally, the data should be prepared as follows:
Rows are observations (individuals) and columns are variables.
Any missing value in the data must be removed or estimated.
The data must be standardized (i.e., scaled) to make variables comparable.
I group the data by medium & year to compute the mean value for each p_group:
## Bündnis 90/ Die Grüne CDU/CSU FDP
## BamS -0.08350076 -0.02256243 -0.03682351
## Bericht aus Berlin -0.11080468 -0.11537941 -0.15316751
## Berlin direkt -0.09875372 -0.08490027 -0.08889407
## Berliner -0.06484421 -0.08807705 -0.06177991
## Bild -0.10511309 -0.04832445 -0.04318226
## Die Welt -0.10129520 -0.04735683 -0.03924910
## Linke/PDS/WASG SPD
## BamS -0.17131152 -0.08143064
## Bericht aus Berlin -0.14847533 -0.12518081
## Berlin direkt -0.15777266 -0.09758545
## Berliner -0.08150096 -0.07650623
## Bild -0.16098418 -0.08496746
## Die Welt -0.12364047 -0.09560428
## Bündnis 90/ Die Grüne CDU/CSU FDP
## BamS -0.007828121 -0.01052960 -0.005050128
## Bericht aus Berlin -0.010457562 -0.04776818 -0.026401083
## Berlin direkt -0.009451222 -0.03705477 -0.012294661
## Berliner -0.012230058 -0.02979307 -0.005277301
## Bild -0.008668780 -0.02165700 -0.004579914
## Die Welt -0.013492593 -0.01908624 -0.003600620
## Linke/PDS/WASG SPD
## BamS -0.003450739 -0.02298595
## Bericht aus Berlin -0.011988974 -0.02985531
## Berlin direkt -0.009969052 -0.02599203
## Berliner -0.005272176 -0.02471322
## Bild -0.004835171 -0.02831759
## Die Welt -0.005276756 -0.03148756
m.unweighted <- na.omit(m.unweighted)
m.weighted <- na.omit(m.weighted)K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:
\[ W(C_k)=\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] where:
Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers (\(\mu_k\)) is minimized.
The object function to be minimized is the total within-cluster sum of square:
\[ \text{tot.withiness} = \sum^k_{k=1}W(C_k)=\sum^k_{k=1}\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] ### K-means Algorithm
K-means algorithm can be summarized as follows:
Specify the number of clusters (K) to be created (by the analyst).
Select randomly k objects from the data set as the initial cluster centers or means.
Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid.
For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a \(Kth\) cluster is a vector of length \(p\) containing the means of all variables for the observations in the \(kth\) cluster; \(p\) is the number of variables.
Iteratively minimize the total within sum of square (Equation above). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached.
The output of kmeans is a list with several bits of information. The most important being:
If we print the results we’ll see that our groupings resulted in 3 cluster sizes of 29, 49, 287. We see the cluster centers (means) for the three groups across the four variables (Bündnis 90/ Die Grüne, CDU/CSU, FDP, SPD). We also get the cluster assignment for each observation (i.e. BamS was assigned to cluster 3 in year 2001, Bericht aus Berlin was assigned to cluster 1 in 2005, etc.).
We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine